A central tension in our modeling is the one between explanation – good causal models – and prediction. In McElreath’s lecture, he leads us to the intuition that predictive models are generally those that do a terrible job of representing the causal model. So the tools covered in this lecture should be considered tools for prediction, but not for identifying causal models.
When trying to maximize prediction, we need to be wary of OVERFITTING – when the model learns too much from the sample. Methods for avoiding overfitting favor simpler models. We’ll make use of REGULARIZING, which helps stop the model from becoming too excited about any one data point. We’ll also discuss scoring devices, like INFORMATION CRITERIA and CROSS-VALIDATION.
But what if you hate carrying an umbrella? Let’s define the cost of getting wet as -5 happiness points and the cost of carrying an umbrella as -1 happiness points. Which weatherperson maximizes our happiness??
Finally, our previous measure of “accuracy” (we used a HIT RATE definition) is only one way to think about accuracy. What if accuracy is knowing the true data generating model? We might consider computing the probability of predicting the exact sequence of days (joint likelihood in Bayesian terms).
This method – LOG SCORING RULE, or the logarithm of the joint probability – is the way we’ll think about accuracy in terms of model prediction.
information theory
How do we measure distance from accuracy? One important thing to keep in mind is that some targets are easier to hit than others. Therefore, the key to measuring distance is to ask, “How much is our uncertainty reduced by learning an outcome?” This reduction is formally referred to as INFORMATION.
We need to formalize our measure of uncertainty. This measurement should:
be continuous;
increase as the number of possible events increases; and
be additive.
This is satisfied by the INFORMATION ENTROPY FUNCTION:
\[
H(p) = - \text{E log}(p_i) = \sum^n_{i=1}p_i\text{log}(p_i)
\] In words: The uncertainty contained in a probability distribution is the average log-probability of an event.
Consider the weather forecast. When the day arrives, the weather is no longer uncertain.1 There were 3 rainy days and 3 sunny days. Therefore:
p <-c( .3, .7 )-sum( p *log(p) )
[1] 0.6108643
This is our uncertainty.
If we lived in Abu Dhabi, we might have a different probability of rain (let’s say .01), and therefore a different amount of uncertainty.
p <-c( .01, .99 )-sum( p *log(p) )
[1] 0.05600153
from entropy to accuracy
We define DIVERGENCE to be the additional uncertainty induced by using probabilities from one distribution to describe another distribution. This is known as the Kullback-Leibler divergence or simply KL divergence.
The divergence between a target \((p)\) and a model \((q)\) is defined as:
\[
D_{KL}(p,q) = \sum_ip_i(\text{log}(p_i) - \text{log}(q_i)) = \sum_ip_i\text{log}(\frac{p_i}{q_i})
\] In plainer language, the divergence is the average difference in log probability between the target and model.
What is the divergence when we get the model exactly correct with the target?
At this point, you’re probably thinking, “How can I put this into practice? Divergence is measuring distance from truth, but I don’t know the truth.” But we won’t use divergence to estimate the difference from one model to truth; we’re only interested in using it to compare two models (q and r) to each other. All we need to know are each model’s average log probability: \(\text{E log}(q_i)\) and \(\text{E log}(r_i)\). To do so, we simply sum over all the observations, \(i\), yielding a total score for each model:
\[
S(q) = \sum_i\text{log}(q_i)
\]
To compute this score for a Bayesian model, we use the entire posterior distribution. The score is known as the LOG-POINTWISE-PREDICTIVE-DENSITY.
Let’s see an example:
m7.1<-quap(alist( brain_std ~dnorm( mu , exp(log_sigma) ), mu <- a + b*mass_std, a ~dnorm( 0.5 , 1 ), b ~dnorm( 0 , 10 ), log_sigma ~dnorm( 0 , 1 ) ), data=d )lppd(m7.1)
Each of these values is the log-probability score for a specific observation.
sum(lppd(m7.1))
[1] 2.480172
This is the total log-probability score for this model.
Sometimes, you’ll see people report deviance, which is just this value multiplied by -2 (for historical reasons).
-2*sum(lppd(m7.1))
[1] -4.992729
One issue with the log-probability score is that it always improves as the model gets more complex. One way to address this is by calculating the log-probability out-of-sample.
When we usually have data and use it to fit a statistical model, the data comprise a TRAINING SAMPLE. Parameters are estimated from it, and then we can imagine using those estimates to predict outcomes in a new sample, called the TEST SAMPLE.
Using out-of-sample prediction doesn’t change your model; rather, it changes our estimation of the predictive accuracy of our model.
Code
N <-20kseq <-1:5# this code will take a long time to run. dev <-sapply( kseq, function(k){print(k); r <-mcreplicate( 1e4 , sim_train_test( N=N, k=k ) , mc.cores=4 );## if using PC, use:# r <- replicate(1e4, sim_train_test( N=N, k=k ) );c( mean(r[1,]), mean(r[2,]), sd(r[1,]), sd(r[2,] ) ) } )plot( 1:5 , dev[1,] , ylim=c( min(dev[1:2,])-5 , max(dev[1:2,])+10 ) ,xlim=c(1,5.1) , xlab="number of parameters" , ylab="deviance" ,pch=16 , col=rangi2 )mtext( concat( "N = ",N ) )points( (1:5)+0.1 , dev[2,] )for ( i in kseq ) { pts_in <- dev[1,i] +c(-1,+1)*dev[3,i] pts_out <- dev[2,i] +c(-1,+1)*dev[4,i]lines( c(i,i) , pts_in , col=rangi2 )lines( c(i,i)+0.1 , pts_out )}
Code
# Set up plotting areaplot(0, 0, type ="n", xlim =c(-3, 3), ylim =c(0, 2), xlab ="parameter value", ylab ="Density",main ="")# Create x-values for plottingx <-seq(-3, 3, length.out =1000)# Generate three density curves with different spreads# Thick curve - very peaked (high kurtosis)y1 <-dnorm(x, mean =0, sd =0.3)# Scale to match the peak height in the originaly1 <- y1 * (2/max(y1))# Medium curve - moderate spready2 <-dnorm(x, mean =0, sd =0.6)# Scale to match the peak height in the originaly2 <- y2 * (0.8/max(y2))# Dashed curve - most spread (normal distribution)y3 <-dnorm(x, mean =0, sd =1)# Scale to match the peak height in the originaly3 <- y3 * (0.4/max(y3))# Add the curves to the plotlines(x, y1, lwd =3)lines(x, y2, lwd =1)lines(x, y3, lwd =1, lty =2)
Another tool in our toolbelt is regularizing priors. REGULARIZATION is a means by which you prevent the model from being “too excited” by the training sample, or to fit too closely to the specific patterns in that sample. There are many tools for regularization (ridge regression, lasso regression), but they all have the effect of downweighting regression parameters towards 0.
As a Bayesian, you also have the tool of regularizing priors. As your priors become more “skeptical” (usually, closer to 0), your model will adapt less to the data. Be wary of having priors that are too tight, because you risk underfitting. (Of course, the more data you have, the less influence your priors have. But that shouldn’t concern you, because overfitting is less of a concern with larger datasets.)
cross-validation
One strategy for estimating predictive accuracy is to actually test the model’s predictive accuracy on another sample. This is known as CROSS-VALIDATION, leaving out a small chunk of observations from our sample and evaluating the model on the observations that were left out.
We’re not actually going to leave out data – imagine! – so instead we’ll divide the data into chunks, or “folds”. The model will then predict each fold after being trained on all the other folds. This is known as k-fold validation. The minimum number of folds is 2, and the maximum is your sample size. The latter is referred to as LEAVE-ONE-OUT-CROSS-VALIDATION, and this is extremely common.
The problem with LOOCV is that it’s computationally intensive. Luckily, there are some clever maths that we can use to approximate the score we would get from running the model over and over. One approach is to use the importance (or weight) of each observation – that is, how much does the prior distribution change if we were to remove this observation from our data? (Similar to influence and leverage.) Importantly, observations that are less likely are more important.
We can use importance in a strategy called PARETO-SMOOTHED IMPORTANCE SAMPLING CROSS-VALIDATION (PSIS). The Pareto part is a smoothing technique that improves the reliability of the importance or weights. By assuming the weights follow a known distribution, the Pareto distribution, we can estimate a reliabile cross-validation score without doing the work of actually cross-validating. We’ll get a PSIS score for each observation in our dataset, as well as a standard error for the score for the entire model.
A second approach is to use the information criteria to computed the expected score out-of-sample. If you look back at the training/testing figure, you’ll find that the difference between training deviance and testing deviance is almost exactly twice the number of parameters in the model (e.g., 2 for the first model with 1 parameter and about 10 for the last with 5 parameters). This is not random, but a known finding in machine learning. We can exploit this for simple estimates of out-of-sample deviance.
A well-known estimate is the AKAIKE INFORMATION CRITERION (AIC):
\[
AIC = D_{\text{train}} + 2p = -2\text{lppd} + 2p
\] where \(p\) is the number of free parameters in the posterior distribution. As the 2 is just there for scaling, what AIC tells us is that the dimensionality of the posterior distribution is a natural measure of the model’s overfitting tendency. More complex models tend to overfit more, directly in proportion to the number of parameters.
AIC isn’t commonly used now. Its approximation is only reliable when:
Priors are flat or overwhelmed by likelihood (data).
The posterior distribution is approximately multivariate Gaussian.
The sample size \(N\) is much greater than the number of parameters \(k\).
Similarly the DEVIANCE INFORMATION CRITERION (DIC) doesn’t assume flat priors but does make the other assumptions.
widely applicable
Watanabe’s WIDELY APPLICABLE INFORMATION CRITERION (WAIC) makes no assumption about the shape of the posterior. Its goal is to guess the out-of-sample KL divergence. In a large sample, the approximation converges to the cross-validation approximation, but in finite samples, there may be some disagreement.
Its calculation is the log-posterior-predictive-density plus a penalty proportional to the variance in the posterior predictions:
\[
\text{WAIC}(y, \Theta) = -2(\text{lppd} - \sum_i\text{var}_{\theta}\text{log}p(y_i|\theta))
\] where \(y\) is the observations and \(\Theta\) is the posterior distribution. The penalty term means, “compute the variance in log-probabilities for each observation \(i\), and then sum up these variances to get the total penalty.”
Like PSIS, WAIC is pointwise, meaning prediction is considered on a point-by-point basis. Therefore,
WAIC has an approximate standard error.
Some observations have stronger influence than others, and we can see this.
it can be hard to define for a single model. Consider for example a model in which each prediction depends upon a previous observation. This happens, for example, in a time series. In a time series, a previous observation becomes a predictor variable for the next observation. So it’s not easy to think of each observation as independent or exchangeable. In such a case, you can of course compute WAIC as if each observation were independent of the others, but it’s not clear what the resulting value means.
comparing cross-validation, PSIS, WAIC
PSIS and WAIC perform very similarly in the context of ordinary linear models. Of course, they may not when our posterior distributions start to get away from Gaussian or when there are highly influential observations.
CV and PSIS have higher variance as estimators of the KL divergence, while WAIC has greater bias. So we should expect each to be slightly better in different contexts.However, in practice any advantage may be much smaller than the expected error. Watanabe recommends computing both WAIC and PSIS and contrasting them. If there are large differences, this implies one or both criteria are unreliable.
PSIS has a distinct advantage in warning the user about when it is unreliable. The \(k\) values that PSIS computes for each observation indicate when the PSIS score may be unreliable, as well as identify which observations are at fault. We’ll see later how useful this can be.
bic?
The Bayesian Information Criterion (BIC), is frequently compared with the Akaike Information Criterion (AIC). It’s important to understand that choosing between these criteria isn’t fundamentally about adopting a Bayesian perspective. Both criteria can be derived through either Bayesian or non-Bayesian approaches, and strictly speaking, neither is purely Bayesian.
BIC is mathematically connected to the logarithm of a linear model’s average likelihood. In Bayesian statistics, this average likelihood serves as the denominator in Bayes’ theorem—essentially the likelihood averaged across the prior distribution. Comparing average likelihoods has long been a standard method for model comparison in Bayesian analysis. These comparisons yield what we call “Bayes factors” when expressed as ratios. When transformed to a logarithmic scale, these ratios become differences, making them conceptually similar to comparing information criteria differences.
Since the average likelihood incorporates the prior distribution, models with more parameters naturally incur a complexity penalty. This helps prevent overfitting, although the exact penalty mechanism differs from that of information criteria. Many Bayesian statisticians have reservations about Bayes factors, and all acknowledge certain technical challenges. One significant obstacle is computational difficulty—calculating average likelihood is often complex. Even when posterior distributions can be successfully computed, estimating average likelihood may remain problematic. Another issue is that while weak priors might minimally impact posterior distributions within individual models, they can dramatically influence comparisons between different models.
The choice between Bayesian and non-Bayesian approaches doesn’t dictate whether to use information criteria or Bayes factors. In practice, there’s value in using both methods and examining where they align and diverge. However, remember that both information criteria and Bayes factors are purely predictive tools that will readily select confounded models without considering causation.
mid-lecture review
Regularizing priors—priors which are skeptical of extreme parameter values—reduce fit to sample but tend to improve predictive accuracy.
How do we choose between several plausible models when seeking to maximize accuracy?
Calculate a measure of information divergence.
Comparing accuracy within-sample is bad: we’ll always favor more complex models.
We can estimate out-of-sample accuracy with any of a number of techniques, but most popularly:
Fit 3 models: 1. Divorce rate predicted from median age at marriage. 2. Divorce rate predicted from marriage rate. 3. Divorce rate predicted from both median age and marriage rate.
solution
m5.1<-quap(alist( D ~dnorm( mu , sigma ) , mu <- a + bA * A , a ~dnorm( 0 , 0.2 ) , bA ~dnorm( 0 , 0.5 ) , sigma ~dexp( 1 )) , data = d )m5.2<-quap(alist( D ~dnorm( mu , sigma ) , mu <- a + bM * M , a ~dnorm( 0 , 0.2 ) , bM ~dnorm( 0 , 0.5 ) , sigma ~dexp( 1 )) , data = d )m5.3<-quap(alist( D ~dnorm( mu , sigma ) , mu <- a + bM*M + bA*A , a ~dnorm( 0 , 0.2 ) , bM ~dnorm( 0 , 0.5 ) , bA ~dnorm( 0 , 0.5 ) , sigma ~dexp( 1 )) , data = d )
Using the full model (both age and marriage rate), ceate a figure that plots the Pareto \(k\) estimate on the x-axis and the WAIC penalty for each observation in the dataset. Label any outliers on this plot. The functions PSIS() and WAIC() will be helpful here, but remember to check the documentation.